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ABSTRACT 

We  study  the  problem  of  mining  and  summarizing  multiple 
time  series  effectively  and  efficiently.  We  propose  PLiF,  a 
novel  method  to  discover  essential  characteristics  (“finger¬ 
prints”),  by  exploiting  the  joint  dynamics  in  numerical  se¬ 
quences.  Our  fingerprinting  method  has  the  following  bene¬ 
fits:  (a)  it  leads  to  interpretable  features;  (b)  it  is  versatile : 
PLiF  enables  numerous  mining  tasks,  including  clustering, 
compression,  visualization,  forecasting,  and  segmentation, 
matching  top  competitors  in  each  task;  and  (c)  it  is  fast 
and  scalable,  with  linear  complexity  on  the  length  of  the 
sequences. 

We  did  experiments  on  both  synthetic  and  real  datasets, 
including  human  motion  capture  data  (17MB  of  human  mo¬ 
tions),  sensor  data  (166  sensors),  and  network  router  traf¬ 
fic  data  (18  million  raw  updates  over  2  years).  Despite  its 
generality,  PLiF  outperforms  the  top  clustering  methods  on 
clustering;  the  top  compression  methods  on  compression  (3 
times  better  reconstruction  error,  for  the  same  compression 
ratio);  it  gives  meaningful  visualization  and  at  the  same 
time,  enjoys  a  linear  scale-up. 

1.  INTRODUCTION 

Time  sequences  appear  in  countless  applications,  like  sen¬ 
sor  measurements  [13],  mobile  object  tracking  [20],  data  cen¬ 
ter  monitoring  [27],  motion  capture  sequences  [19,  21],  en¬ 
vironmental  monitoring  (like  chlorine  levels  in  drinking  wa¬ 
ter  [25])and  many  more.  Given  multiple,  interacting  time 
sequences,  how  can  we  group  them  according  to  similar¬ 
ity?  How  can  we  find  compact,  numerical  features  (“ finger¬ 
prints ”),  to  describe  and  distinguish  each  of  them? 

Researches  in  time  sequences  form  two  broad  classes: 

(a)  Feature  extraction  (and  similarity  search,  indexing  etc), 
using,  say,  Fourier  or  wavelet  coefficients,  piece-wise 
linear  approximations  and  similar  methods  and 

(b)  forecasting,  like  an  autoregressive  integrated  moving 
average  model  (ARIMA)  and  related  methods. 
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The  former  class  is  useful  for  querying:  indexing,  similarity 
searching,  clustering.  The  latter  class  is  useful  for  mining: 
forecasting,  missing  value  imputation,  anomaly  detection. 
Can  we  develop  a  method  that  has  the  best  of  both  worlds? 
Extracting  the  essence  of  time  sequences  is  already  very  use¬ 
ful  -  it  would  be  even  more  useful  if  those  features  are  easy 
to  interpret,  and  even  better  if  they  could  help  us  do  fore¬ 
casting.  Ability  to  forecast  automatically  leads  to  anomaly 
detection  (every  time-tick  that  deviates  too  much  from  our 
forecast),  segmentation  (a  time  interval  deviating  too  much 
from  our  forecast),  compression  (storing  the  deltas  from  the 
forecasts),  and  missing  value  imputation,  extrapolation  and 
interpolation.  And  of  course,  we  would  like  the  method  to 
be  scalable,  with  linear  complexity  on  the  length  of  the  se¬ 
quences.  Is  it  possible  to  achieve  as  many  as  possible  of  the 
above  goals,  any  of  which  alone  is  already  very  useful?  The 
proposed  PLiF  method  gives  a  positive  answer:  the  idea  is  to 
extract  the  essential  numerical  representation  that  charac¬ 
terizes  the  evolving  dynamics  of  the  sequences,  specifically, 
to  fit  a  Linear  Dynamical  System  (LDS)  on  the  collection  of 
m  sequences,  and  then  we  show  how  to  extract  a  few,  but 
meaningful  features  out  of  the  LDS.  We  will  refer  to  those 
features  as  the  “ fingerprints ”  of  each  sequence.  We  further 
show  that  the  proposed  fingerprints  achieve  all  the  above 
goals: 

1.  Effectiveness:  the  resulting  features  lead  to  a  natural 
distance  function,  which  agrees  with  human  intuition 
and  the  provided  ground  truth.  Thus,  fingerprints  lead 
to  good  clustering  as  well  as  visualization  (see  Fig.  1(d) 
and  Fig.  5); 

2.  Interpretability :  as  we  will  show,  the  fingerprints  cor¬ 
respond  to  groups  of  harmonics; 

3.  Forecasting:  they  naturally  lead  to  forecasting  and 
compression; 

4.  Scalability:  the  proposed  PLiF  method  is  fast  and  scal¬ 
able  on  the  size  of  the  sequences. 

Table  1  compares  PLiF  against  traditional  methods  and 
illustrates  their  strengths  and  shortcomings:  a  checkmark 
(/)  indicates  that  the  corresponding  method  (column)  ful¬ 
fills  the  corresponding  requirement  (row).  Only  PLiF  has  all 
entries  as  checkmarks.  In  more  detail  (also  see  Appendix  B), 
the  desirable  fingerprints  should  allow  for  lags,  and  small 
variations  in  frequency.  While, 

•  Fourier  analysis  and  wavelet  methods  could  identify 
the  frequencies  in  a  single  signal,  but  can  not  cross¬ 
correlate  similar  signals,  nor  do  forecasting1. 

1One  might  argue  that  Fourier  coefficients  can  do  rudimen- 
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(b)  z- value  of  right  foot  (d)  fingerprints:  FP1  vs 
marker  FP2 

Figure  1:  Motion  capture  (mocap)  sequences:  sam¬ 
ple  data  and  visualization.  1(a):  a  film-strip  of  a 
human  motion.  1(b):  right  foot  position  for  a  walk¬ 
ing  motion  (top),  and  a  running  one  (bottom).  1(c): 
Scatter-plot  of  two  principal  components  for  walk¬ 
ing  (blue  circles)  and  running  sequences  (red  stars), 
without  clear  separation.  1(d):  Scatter-plot  of  the 
“fingerprints”  (FP)  by  PLiF  -  first  FP  versus  sec¬ 
ond  FP,  for  all  49  mocap  sequences.  Notice  the 
near-perfect  separation  of  the  walking  motions  (blue 
circles),  from  the  running  ones  (red  stars). 

•  Singular  value  decomposition  (SVD)  and  its  “centered” 
version,  principal  component  analysis  (PCA),  do  cap¬ 
ture  correlations  (by  doing  soft  clustering  of  similar  se¬ 
quences)  and  thus  derive  hidden  (“latent”)  variables, 
but  they  can  not  do  forecasting  either,  nor  is  it  easy 
to  interpret  the  derived  hidden  variables. 

•  Standard  Linear  Dynamical  Systems  (LDS)  and  Kalman 
filters  can  capture  correlations,  as  well  as  do  forecast¬ 
ing.  However,  the  resulting  hidden  variables  are  hard 
to  interpret  (see  Fig.  2(c)),  and  they  do  not  lead  to  a 
natural  distance  function. 

Finally,  we  do  not  show  typical  distance  functions  for  time 
series  clustering  in  Table  1:  the  Euclidean  distance  (sum  of 
squares  of  differences  at  each  time-tick)  and  the  Dynamic 
Time  Warping  (DTW)  distance.  The  reason  is  that  none 
of  them  leads  to  forecasting,  nor  to  feature  extraction,  and 
thus  the  interpretability  requirement  is  out  of  reach.  More¬ 
over,  the  typical,  un-constrained,  version  of  DTW  fails  the 
scalability  requirement,  being  quadratic  on  the  length  of  the 
sequences. 

To  make  the  discussion  more  concrete,  we  refer  to  one 
of  our  motivating  applications,  motion  capture  [mocap)  se¬ 
quences:  For  each  motion,  we  have  93  numbers  (positions 
or  angles  of  markers  on  the  joints  of  the  actor),  with  a 
variable  number  of  frames  (=time-ticks)  for  each  such  se¬ 
quence,  typically  60-120  frames  per  second.  See  Fig.  1(b) 
for  two  such  example  sequences,  both  plotting  the  right- 
foot  z-value  as  a  function  of  time,  for  one  of  the  walking 

tary  forecasting,  since  they  can  generate  values  for  time  ticks 
outside  the  initial  time  range  (1,  . . . ,  T);  however,  these  val¬ 
ues  will  just  lead  to  repeating  the  initial  signal,  with  ringing 
phenomena  if  the  signal  has  a  trend. 


Table  1:  Capabilities  of  Approaches.  Only  PLiF 
meets  all  specs'5. 


SVD/ 

PCA 

DFT/ 

DWT 

LDS 

PLiF 

Correlation  Discovery 

/ 

/ 

/ 

Interpretability 

/ 

/ 

Forecasting 

/ 

/ 

motions,  and  one  of  the  running  ones,  from  the  publicly 
available  mocap .  cs  .  emu .  edu  repository  of  mocap  sequences. 

Given  a  large  collection  of  such  human  motion  sequences, 
we  want  to  find  clusters  and  to  group  similar  motions  to¬ 
gether.  The  desirable  features/fingerprints  would  have  the 
following  properties: 

•  PI:  Lag  independence:  two  walking  motions  should  be 
grouped  together  even  if  they  start  at  different  footstep 
or  phase; 

•  P2:  Frequency  proximity:  running  motions  with  nearby 
speed  of  motion  should  be  grouped  together; 

•  P3:  Harmonics  grouping:  Several  sensor  measurements, 
like  human  motion,  human  voice,  automobile  traffic, 
obey  several  periodicities,  often  with  related  frequen¬ 
cies  ( “harmonics” ) .  We  would  like  to  detect  such  groups 
of  harmonics. 

Fig.  1(d)  gives  a  quick  preview  of  the  visualization  and 
effectiveness  of  the  proposed  PLiF  method:  For  the  49  se¬ 
quences  we  have,  we  map  each  to  its  two  fingerprint  values, 
thus  making  it  a  2-d  point.  Those  49  points  are  shown  in 
Fig.  1(d),  using  ‘stars’  for  motions  that  were  (manually) 
labeled  as  running  motions,  and  circles  for  the  walking  mo¬ 
tions.  Notice  how  clearly  the  two  groups  can  be  separated 
by  a  vertical  line  at  x=0. 

The  rest  of  the  paper  is  organized  in  the  typical  way:  In 
the  upcoming  sections,  we  give  the  problem  definition  and 
a  running  example,  then  we  list  the  shortcomings  of  earlier 
methods,  the  description  of  PLiF,  experiments,  related  work 
and  conclusions. 

2.  PROBLEM  DEFINITION  &  RUNNING  EX¬ 
AMPLE 

For  the  sake  of  exposition,  we  provide  an  arithmetic  ex¬ 
ample  here  to  demonstrate  our  proposed  PLiF  method.  As 
mentioned  in  the  introduction,  the  problem  is  as  follows: 

Problem  1  (Time  sequence  fingerprinting).  Given 
m  time  sequences  of  length  T ,  Extract  features  that  match 
the  four  goals  in  the  introduction. 

The  four  goals  are  that  (a)  the  features  should  be  effective, 
capturing  the  essence  of  what  humans  consider  in  similar  se¬ 
quences;  (b)  interpretable  (c)  they  should  lead  to  forecasting 
and  (d)  their  computation  should  be  fast  and  scalable. 

More  specifically,  for  the  first  goal  of  effectiveness,  we  want 
the  fingerprints  to  have  properties  P1-P3,  namely,  lag  inde¬ 
pendence,  frequency  proximity  and  harmonics  grouping. 

Thus,  we  use  the  following  illustrative  sequences  (see  Fig¬ 
ure  2(a)),  of  length  T=500  time-ticks,  as  defined  in  Table  2. 

3  Scalability  has  not  been  shown  as  all  methods  here  have 
computation  time  linear  to  the  length  of  data  sequences. 


Table  2:  Illustrative  sequences 


Equation 

Comment 

(a)  X\  =  sin(27rt/100) 

period  100 

(b)  A'2  =  cos(2-7rf/100) 

time-shifted  of  (a) 

(c)  X3  =  sin(27rf/100)  + 
cos(27rf/100) 

time-shifted  & 
higher  amplitude 

(d)  A4  =  sin(27rt/110)  + 

0.2  sin(27rt/30) 

mixture  of  two  waves 
of  periods  110  and  30 

(e)  X5  =  cos(27rt/110)  + 

0.2  sin(27rt/30  +  rr/4) 

same  as  (d)  but  lag  in 
both  components 

The  first  three  sequences  have  period  100,  with  differing 
lags  and  amplitudes  and  thus  we  would  like  them  to  fall  into 
the  same  group.  The  remaining  two  combine  two  frequencies 
(with  periods  110  and  30),  and  a  small  phase  difference; 
according  to  the  P1-P3  properties,  we  would  expect  them 
to  form  another  group  of  their  own. 

Fig.  2(a)  also  shows  the  first  sequence,  in  dashed  line 
form,  so  that  we  can  visually  compare  the  five  sequences.  As 
mentioned  in  the  introduction,  and  as  we  elaborate  in  Ap¬ 
pendix  B,  the  typical  method  for  dimensionality  reduction 
(and  thus  feature  extraction)  is  PCA/  SVD;  and  the  typ¬ 
ical  method  for  forecasting  is  autoregression  and  its  more 
general  form,  Linear  Dynamical  Systems  (LDS,  where  we 
specifically  use  the  output  matrix).  We  show  the  resulting 
features  for  each  method  as  gray-scale  heat-maps  (Fig.  2(b)- 
2(d)),  where  rows  are  sequences,  columns  are  features  (= 
fingerprints),  and  black  color  indicates  high  values.  Rows 
that  are  visually  similar  means  that  they  have  similar  fea¬ 
ture  values  and  thus  would  end  up  in  the  same  cluster. 

Interpreting  the  heat-maps  or  “Why  not  PCA  or 
LDS?”.  In  short,  only  PLiF  gives  effective  features.  Specif¬ 
ically,  notice  that  PCA  (Fig.  2(b))  yields  similar  rows  for 
sequence  (a)  and  (e)  -  they  are  indeed  visually  similar  in 
their  time-plots,  too,  with  small  Euclidean  distance  (sum  of 
square  of  daily  differences).  This  is  not  surprising,  because 
PCA  and  SVD  actually  preserve  the  Euclidean  distance  as 
best  as  possible  -  except  that  the  Euclidean  distance  fails 
our  desirable  goals,  heavily  penalizing  lags.  Similarly  us¬ 
ing  the  output  matrix  C  from  LDS  (Fig.  2(c))  gives  a  poor 
grouping. 

The  heat-maps  above  explain  why  both  PCA/SVD,  as 
well  as  LDS,  will  lead  to  poor  clustering,  after  we  apply,  say, 
k-means  [11].  In  contrast,  the  heat-map  of  our  proposed 
method  PLiF  (Fig.  2(d))  gives  the  expected  groupings:  the 
last  two  sequences  are  clearly  together,  with  dark  color  in 
their  first  column;  and  the  first  three  sequences  have  dark 
color  in  their  second  column. 

As  we  show  in  later  sections,  PLiF  also  gives  the  inter¬ 
pretation  for  each  feature:  the  corresponding  “harmonic” 
groups  and  the  strength  of  them  in  each  of  the  sequences 
(Fig.  3(c)).  Although  the  above  is  synthetic  data,  such  lag 
correlation  and  frequency  combinations  are  common  in  time 
series  data  such  as  motion  capture  data  and  sensor  data. 

3.  PROPOSED  METHOD:  PLiF 

We  describe  our  proposed  method  PLiF  (Parsimonious 
Linear  Fingerprinting)  in  this  section.  First  we  give  the 
basic  variation  (PLiF-naive)  and  explain  its  steps  and  in 
Appendix  C.3  we  show  how  to  make  it  even  faster.  Table  3 


(c)  LDS 


(d) 

PLiF 


Figure  2:  Running  example:  5  synthetic  sequences 
(top  3,  with  period  100  and  possibly  shifts;  rest 
2,  with  periods  110  and  30.  2(a):  the  time  plots. 
2(b), 2(c), 2(d):  ‘heat-maps’  of  fingerprints  for  each 
sequence,  using  PCA,  LDS  and  PLiF,  respectively. 
PLiF  gives  similar  fingerprints  for  the  top  3  and  the 
bottom  2  sequences  respectively,  while  competitors 
do  not. 


Table  3:  Symbols 


m 

number  of  sequences 

T 

duration  (length)  of  sequences 

h 

number  of  hidden  variables  for  each  time  tick 

h1 

number  of  harmonics  excluding  conjugates 

X 

data  matrix,  m  x  T 

Zn 

hidden  variables  for  time  n,  h  x  1  vector 

fio 

initial  state  for  hidden  variable,  h  x  1  vector 

A 

transition  matrix  (like  “Newton  dynamics”),  hxh 

C 

output  matrix  (“hidden  to  observation”)  m  x  h 

V 

compensation  matrix,  eigenvectors  of  A,  h  x  h 

A 

eigen-dynamics  matrix  (eigenvalues  of  A),  h  x  h 

ch 

harmonic  mixing  matrix,  m  x  h 

cm 

harmonic  magnitude  matrix,  m  x  h' 

F 

fingerprints 

gives  an  overview  of  the  symbols  used  and  their  definitions. 

Goal.  To  recap,  we  want  to  solve  Problem  1,  Given  multi¬ 
ple  time  sequences  of  same  duration  T,  find  features  which 
have  the  four  properties  namely:  (a)  Effective  (can  be  used 
for  visualization  and  clustering);  (b)  Meaningful;  (c)  Gen¬ 
erative  (can  be  used  for  forecasting  and  compression);  and 
(d)  Scalable. 

Each  following  subsection  describes  a  step  in  our  algo¬ 
rithm.  At  the  high  level,  PLiF  (a)  discovers  the  “Newton” - 
like  dynamics,  using  a  modified,  faster  way  of  learning  an 
LDS;  (b)  normalizes  the  resulting  transition  matrix  A,  which 
reveals  the  natural  frequencies  and  exponential  decays  /  ex¬ 
plosions  of  the  given  set  of  sequences  (which  we  refer  to 
as  harmonics,  see  Definition  1);  and  (c)  groups  some  of 
those  harmonics/hidden  variables,  after  ignoring  the  phase, 
thus  accounting  for  lag-correlations.  The  discovered  groups 
of  such  frequencies  are  exactly  the  “fingerprints”  (features) 
that  PLiF  is  using  for  clustering,  visualization,  compression, 
etc. 


3.1  Learning  Dynamics 

Intuition.  To  understand  the  hidden  pattern  in  the  multi¬ 
ple  signals,  we  want  to  extract  the  hidden  dynamics,  similar 


to  “Newtonian”  dynamics  like  velocities  and  accelerations. 
Our  basic  intuition  is  to  assume  that  there  is  a  series  of 
hidden  variables,  representing  the  states  of  the  hidden  pat¬ 
tern,  which  are  evolving  according  to  a  linear  transformation 
and  are  linearly  transformed  to  the  observed  numerical  se¬ 
quences. 

Theory.  To  obtain  the  dynamics  in  data,  we  use  an  un¬ 
derlying  Linear  Dynamical  System  (LDS)  to  model  multiple 
time  series  (Eq.  11  and  12).  We  use  the  EM  algorithm  (de¬ 
scribed  in  Appendix  B.3)  for  learning  the  model  parameters. 

Si  =  po+noise  zn+ 1  =  A  zn+noise  xn  =  Czn+noise 

(n  =  1, . . . ,  T). 

The  LDS  model  includes  parameters  of  an  initial  state 
vector  p0,  a  transition  matrix  A  and  an  output  matrix  C 
(along  with  the  noise  covariance  matrices).  Similar  to  “New¬ 
tonian”  dynamics,  the  transition  matrix  A  will  predict  the 
hidden  variables  (like  the  velocity  and  acceleration  in  hu¬ 
man  motions)  for  the  next  time  tick,  and  the  output  matrix 
C  will  tell  us  how  the  hidden  variables  (e.g.  velocities  and 
accelerations)  map  to  the  observed  values  (e.g.  positions)  at 
each  time  tick  (each  row  of  C  corresponds  to  one  sequence). 

Note  that  as  discussed  before,  the  transition  matrix  A  is 
not  unique:  it  is  subject  to  permutation,  rotation  and  linear 
combinations,  and  so  is  the  output  matrix  C.  Thus  each 
row  in  C  can  not  uniquely  identify  the  characteristics  of  the 
corresponding  series.  Our  subsequent  steps  are  motivated 
by  this  observation. 

Example.  On  using  h  =  6  hidden  variables  to  learn  the  pa¬ 
rameters  for  the  5  synthetic  sequences  shown  in  Figure  2(a), 
we  will  get  a  6  x  6  transition  matrix  A,  a  5  x  6  output  matrix 
C  and  a  6  x  1  initial  state  vector  po.  Figure  2(c)  shows  the 
C  matrix.  Clearly,  it  is  all  jumbled  up  and  doesn’t  show  any 
clear  pattern  w.r.t.  the  sequences. 

3.2  Canonicalization 

Intuition.  Equations  of  the  linear  system  (see  Appendix  B, 
Eq.  11)  tell  that  the  hidden  variables  (zjj)  can  have  only  a 
limited  number  of  modes  of  operation  that  depend  on  the 
eigenvalues  of  the  A  matrix:  The  behavior  can  be  expo¬ 
nential  decay  (real  eigenvalue,  with  magnitude  less  than  1), 
exponential  growth  (real  eigenvalue,  with  magnitude  greater 
than  1),  sinusoidal  periodicity  of  increasing  /  constant  /  de¬ 
creasing  amplitude  (complex  eigenvalue  a  +  bi  controlling 
both  amplitude  and  frequency)  and  mixtures  of  the  above. 
Those  eigenvalues  directly  capture  the  amplitude  and  fre¬ 
quencies  of  the  underlying  signals  of  hidden  variables,  which 
we  refer  to  as  harmonics  (Definition  1).  Our  goal  in  this  step 
is  to  identify  the  canonical  form  of  the  hidden  harmonics  and 
how  they  mix  in  the  observation  sequences. 

Theory.  We  know  that  a  set  of  similar  matrices  share  the 
same  eigenvalues  [9].  Hence,  we  propose  to  perform  the 
eigenvalue  decomposition  of  the  transition  matrix  A,  and 
obtain  the  corresponding  eigen-dynamics  matrix  and  eigen¬ 
vectors. 

A  =  VAV*  (1) 

where  V  *  V*  =  I.  The  matrix  V  contains  the  eigenvectors 
of  A  and  A  is  a  diagonal  matrix  of  eigenvalues  of  A.  We 


can  justify  doing  the  decomposition  because  over  C  almost 
every  matrix  is  diagonalizable.  Specifically,  the  probability 
that  a  square  matrix  of  arbitrary  fixed  size  with  real  entries 
is  diagonalizable  over  C  is  1  (see  Zhang  [32]).  Also  without 
loss  of  generality,  we  assume  the  eigenvalues  are  grouped 
into  conjugate  pairs  (if  any)  and  ordered  according  to  their 
phases. 

Note  that  the  output  matrix  C  in  LDS  represents  how  the 
hidden  variables  translate  into  observation  sequences  with 
linear  combinations.  In  order  to  obtain  the  same  observa¬ 
tion  sequences  from  A  as  the  transition  matrix,  we  need  to 
compensate  the  output  matrix  C  to  get  the  harmonic  mixing 
matrix  C  h- 

Ch  =  C-V  (2) 

Similarly,  the  canonical  hidden  variables  will  be: 

Mo  =  V  •  p0  (3) 

Zn  =  V  •  Zn  (4) 

The  following  two  lemmas  tell  how  the  harmonic  mixing 
matrix  C  /,  looks  like  and  how  the  canonical  hidden  variables 
correspond  to  the  original  dynamical  system: 

Lemma  1.  V  has  conjugate  pairs  of  columns  correspond¬ 
ing  to  the  conjugate  pairs  of  eigenvalues  in  A.  Hence,  the 
harmonic  mixing  matrix  Cj,  must  contain  conjugate  pair  of 
columns  corresponding  to  the  conjugate  pairs  of  the  eigen¬ 
values  in  A. 

Proof  Sketch:  See  Appendix. 

Lemma  2. 

-(new  *  n—  1  -»new  .  /r\ 

zn  =  A  ■  Po  +  noise  (5) 

xn  =  C h  ■  A"1-1  •  pSew  +  noise  (6) 

Proof  Sketch:  See  Appendix. 

The  intuition  is  that  all  hidden  variables  z„,  all  canoni¬ 
cal  hidden  variables  zffw ,  and  all  observations  xn  (n  = 
1 ,....,  T)  are  mixtures  of  a  set  of  growing,  shrinking  or  sta¬ 
tionary  sinusoid  signals,  of  data-dependent  frequencies;  we 
refer  to  those  signals  as  “harmonics'" ,  and  their  character¬ 
istic  frequencies  and  amplitudes  are  completely  defined  by 
the  eigenvalues  of  the  transition  matrix  A.  “Harmonics”  are 
formally  defined  as: 

Definition  1.  A  signal  yn  is  said  to  be  a  harmonic  if  it 
is  in  the  form  of  yn  =  ( a+bi)n ,  where  a+bi  is  an  eigenvalue 
of  A  and  i  =  \f— T. 

Our  definition  of  harmonic  is  related  to  the  frequencies 
that  the  Fourier  transform  would  discover,  with  two  major 
differences:  (a)  exponential  amplitude:  our  harmonic  func¬ 
tions  could  be  growing  or  shrinking  exponentially  (for  1) 
(b)  generality:  the  frequencies  of  the  Discrete  Fourier  Trans¬ 
form  (DFT)  are  always  multiples  of  the  base  frequency  1/T, 
while  our  harmonics  could  have  any  arbitrary  frequency  (6 
could  take  any  value  that  fits  the  given  data  sequences). 

Example.  On  performing  this  step  on  the  learned  tran¬ 
sition  matrix  from  the  5  synthetic  sequences,  we  will  get  a 
6x  6  A  matrix  with  eigenvalues  0.998±0.057i,  0.998±0.063i, 
and  0.978  ±  0.208i  on  the  diagonal.  From  our  above  discus¬ 
sion,  we  know  that  when  the  real  part  of  the  eigenvalue  = 
1,  then  the  signal  is  a  sinusoid  -  which  is  the  case  here.  The 


imaginary  part  on  the  other  hand  corresponds  to  the  fre¬ 
quencies.  Thus  0.057  «  27r/110  corresponds  to  frequency 
1/110,  0.063  ss  27t/100  corresponds  to  frequency  1/100  and 
0.208  R3  2-7t/30  corresponds  to  frequency  1/30  -  which  are 
precisely  the  base  frequencies  in  the  data. 

Figure  3(a)  shows  the  matrix  C h  obtained  for  the  sam¬ 
ple  signals.  We  have  shown  the  entries  in  the  standard 
polar  form:  Ae ^ ,  A  is  the  magnitude  and  <j>  is  the  angle 
(phase).  For  clarity,  the  values  which  are  very  small  have 
been  shown  as  0  in  the  matrix.  Firstly  as  expected  from 
Lemma  1,  we  have  conjugate  pairs  of  columns  correspond¬ 
ing  to  the  eigenvalues  which  correspond  to  the  frequencies. 
Secondly,  note  that  signals  (a)  and  (b)  are  the  same  sinu¬ 
soid  but  with  just  different  phases  (specifically  a  phase  dif¬ 
ference  of  7t/2).  Hence  in  the  matrix  C /,  they  have  the  same 
frequency  components  (high  values  only  in  the  conjugate 
columns  3  and  4)  with  the  same  weights  (same  A  value) 
but  with  different  phases  (different  <f>  values).  So,  if  we  di¬ 
rectly  try  to  cluster  Ch,  we  will  not  place  them  in  the  same 
cluster.  Thirdly,  the  phase  difference  is  also  preserved  in 
Ch-  0.82  +  0.75  =  1.5708  =  n/2  =  the  phase  difference  be¬ 
tween  signals  (a)  and  (b).  This  can  also  be  verified  for  the 
signals  (d)  and  (e)  which  have  different  phases  for  the  two 
constituent  frequencies. 

3.3  Handling  Lag  Correlation:  Polar  Form 

Intuition.  As  specified  in  Section  1,  the  ideal  features  should 
catch  lag  correlation.  After  computing  the  harmonic  mixing 
matrix  Ch,  we  have  found  the  contribution  of  each  harmonic 
in  the  resulting  observation  sequences.  Each  row  in  C /,  rep¬ 
resents  the  characteristics  of  each  data  sequence  in  the  do¬ 
main  of  the  harmonics.  Thus  Ch  can  plausibly  be  used  to 
cluster  the  sequences.  However,  the  harmonic  mixing  ma¬ 
trix  not  only  tells  the  strength  of  each  eigen-dynamic  but 
will  also  encode  the  required  phases  for  different  sequences. 
Thus  we  will  fail  to  group  similar  motions  just  due  to  the 
lag  or  phase  differences.  Intuitively  for  example,  suppose  we 
have  two  almost  identical  walking  motions,  except  that  one 
starts  from  the  left  foot  and  another  from  the  right  foot. 
We  want  to  extract  features  that  could  identify  the  walking 
behavior,  no  matter  which  foot  it  starts  with,  so  that  we 
would  be  able  to  group  the  two  walking  motions  together. 

Theory.  We  eliminate  phase  by  taking  the  magnitude  of 
the  harmonic  mixing  matrix  Ch  abs(Ch).  From  Lemma  1 
we  will  get  the  same  column  for  those  conjugate  columns  of 
Ch',  we  drop  these  duplicate  columns  to  get  the  harmonic 
magnitude  matrix  Cm-  The  harmonic  magnitude  matrix 
Cm  tells  how  strong  each  base  harmonic  participates  in  the 
observation  time  sequences  and  naturally  leads  to  lag  inde¬ 
pendent  features  (PI,  Sec.  1). 

Lemma  3.  abs{Ch)  contains  pairs  of  identical  columns. 

Example.  Figure  3  (b)  and  (c)  show  the  matrix  C„,  ob¬ 
tained  after  applying  this  step  on  the  generated  Ch  matrix 
for  the  synthetic  signals.  Note  that  Cm  has  begun  to  show 
some  clear  patterns  corresponding  to  the  underlying  true 
clusters. 

3.4  Grouping  Harmonics 

Intuition.  The  harmonic  magnitude  matrix  Cm  captures 
the  contributing  coefficients  of  each  individual  frequency. 


As  we  find  harmonic  frequency  sets  in  music,  in  real  time- 
series  like  motions,  we  will  expect  to  usually  find  motion 
sequences  composed  of  several  major  frequencies.  Hence  we 
now  want  to  find  such  harmonics  groups  (P3  as  stated  in 
Section  1)  capable  of  describing  common  characteristics  of 
similar  motion  sequences,  and  the  corresponding  representa¬ 
tions  of  each  sequence  in  such  harmonics  group  space.  As  a 
concrete  example,  say  we  want  to  determine  that  walking  se¬ 
quences  are  composed  of  10  units  of  frequency  1  and  1  units 
of  frequency  2,  while  running  motions  have  say  10  units  of 
frequency  2  and  1  units  of  frequency  3.  Furthermore,  a  fast 
walking  motion  may  in  fact  have  a  proper  mix  of  both  a 
walking  frequency-group  and  running  frequency-group. 

Theory.  To  achieve  this  goal,  we  can  use  any  dimensional 
reduction  method  such  as  SVD/PCA,  ICA  or  nonnegative 
matrix  factorization.  For  simplicity,  we  take  the  singular 
value  decomposition  (SVD)  of  the  harmonic  magnitude  ma¬ 
trix  Cm.  As  we  introduced  earlier,  SVD  is  capable  of  finding 
low  rank  projection  of  the  data  matrix.  Cm  ~  Uk  •  Sk  ■  Vjf 
where  Cm  is  column  centered  from  Cm,  Uk  and  Vk  are 
orthogonal  matrices  with  k  columns,  and  Sk  is  a  diagonal 
matrix.  The  diagonal  of  Sk  contains  k  singular  values  which 
are  usually  sorted  by  magnitude.  Finally,  we  obtain  the 
features  as  follows: 

F  =  Uk  •  Sk  (7) 

Example.  Figure  2(d)  shows  the  final  F  matrix  obtained 
from  the  sample  sequences.  Note  that  this  matrix  very 
clearly  brings  out  the  correct  groupings.  Also  notice  that 
in  the  corresponding  Cm  matrix,  sequences  (d)  and  (e)  had 
high  components  on  columns  1  and  3  (which  map  to  the  2 
frequencies  generating  those  signals).  But  after  doing  SVD, 

F  gives  us  a  clearer  and  simpler  picture  where  they  are 
shown  to  be  more  related  by  having  the  same  ’group’  of 
harmonics  combined  in  the  same  way. 

3.5  Discussion 

Choosing  h:.  A  particular  issue  in  the  learning  algorithm 
is  choosing  a  proper  number  h  for  the  hidden  dimension  of 
underlying  LDS.  In  practice,  we  use  the  “80%-95%”  energy 
criterion  to  determine  h  [5].  That  is,  we  take  the  Singular 
Value  Decomposition  of  X,  rank  the  singular  values,  and 
then  choose  h  at  the  rank  with  95%  of  the  total  sum  of 
squared  singular  values:  h  <—  argh  ,  where  sfs  are 

singular  values  of  X  in  descending  order. 

Complexity: 

Lemma  4.  The  straightforward  implementation  of  the  al¬ 
gorithm  (refered  to  as  PLiF-naive)  costs  0(jf  iteration  •  T  • 

(m3  +  h3)),  where  jf  iteration  is  the  number  of  iterations  for 
learning  LDS. 

However,  using  the  Woodbury  matrix  identity  [9]  and  in¬ 
crementally  computing  inverse  of  covariance  matrix  [25], 
the  complexity  can  be  reduced  dramatically  as  stated  in 
Lemma  5.  See  Appendix  C.3  for  details. 

Lemma  5.  PLiF  can  be  computed  within  time  of  iteration- 
(T  •  (m2  •  h  +  h3))  +  m  ■  h2). 

4.  EXPERIMENTAL  RESULTS 
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Figure  3:  Running  example:  the  synthetic,  sinusoidal  signals  of  Figure  2(a),  and  the  output  matrices  according 
to  PLiF:  (a)  The  harmonic  mixing  matrix  C/,  and  (b)  the  harmonic  magnitude  matrix  Cm  and  (c)  its  heat-map 
(darker  color  -  higher  value  in  that  cell).  Near-zero  values:  omitted  for  clarity.  Notice  that  (1)  the  columns 
of  (a)  are  complex  conjugates,  in  pairs;  (2)  the  harmonic  magnitude  matrix  Cm  makes  similar  sequences  to 
look  similar  (top  3,  bottom  2). 


Please  see  a  description  of  our  experimental  environment 
in  the  Appendix  C.2. 

4.1  Effectiveness:  Visualization 

As  stated  in  the  introduction  section,  our  proposed  method 
PLiF  is  capable  of  producing  meaningful  features:  each  fea¬ 
ture  column  corresponds  to  a  group  of  “harmonic”  frequen¬ 
cies  (one  or  more)  and  features  represent  the  participation 
coefficients  of  the  harmonic  group  in  the  sequence. 

For  the  MOCAP  dataset,  we  found  interpretable  and  inter¬ 
esting  patterns  in  its  fingerprints  (Fig.  4).  In  our  experi¬ 
ment,  we  use  hidden  dimension  h  =  7  as  suggested  by  the 
95%  criteria,  and  produce  two  fingerprints  for  each  sequence 
( k  =  2).  The  walking  motions  exhibit  strong  correlation 
with  harmonics  with  eigen-values  0.998  ±  0.053*,  equivalent 
to  the  frequency  of  1/119,  while  the  running  ones  are  cor¬ 
related  with  eigenvalues  1.007  ±  0.082*  and  0.989  ±  0.108*, 
equivalent  to  the  frequencies  of  1/78  and  1/58. 

We  already  presented  meaningful  features,  both  visually 
and  numerically,  extracted  from  multiple  sequences  by  our 
proposed  PLiF  method.  Thanks  to  those  features,  PLiF  can 
be  readily  used  for  almost  all  mining  tasks  for  time  series, 
namely  clustering,  compression,  forecasting  and  segmenta¬ 
tion.  While  forecasting  and  segmentation  are  straightfor¬ 
ward  brought  by  the  underlying  dynamical  system  of  our 
method,  we  will  focus  on  the  particular  application  of  PLiF 
in  time  series  clustering  and  compression. 

4.2  Effectiveness:  Clustering 

The  rationale  in  our  clustering  method  lies  in  the  fact  that 
the  fingerprints  (features)  computed  by  PLiF  characterize 
how  much  each  “harmonic”  group  participates  in  each  of  the 
sequences.  Essentially,  such  a  fingerprint  tells  the  projection 
of  each  sequence  onto  the  basis  of  the  “harmonic”  group. 
The  final  clustering  result  can  be  then  obtained  by  applying 
any  state-of-the-art  clustering  algorithm,  such  as  k-means 
or  spectral  clustering  [11]  (Chap  14.3.6  &  Chap  14.5.3)  on 
the  fingerprints. 

In  our  experiments,  we  use  simple  thresholding  (=0)  on 
the  first  fingerprint  (FP1)  to  tell  the  group,  equivalent  to 
k-means  on  FP1.  In  this  way  PLiF  can  produce  two  class 
grouping.  But  it  can  be  easily  extended  to  handle  multiple 
class  case,  through  the  hierarchical  framework  [11]  (Chap 
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Figure  4:  Mocap  fingerprints  and  visualization.  4(a) 
displays  several  sample  sequences,  top  two  of  which 
are  walking  (#15  and  #22),  followed  by  two  running 
ones  (#45  and  #38)  and  a  running-to-stop  motion 
(#8).  4(b):  Each  motion(row)  displays  two  finger¬ 
prints.  Upper  26  rows  are  walking  motion,  and  the 
rest  are  running  motion.  4(c):  Walking  motion  are 
in  blue  circles,  and  running  in  red  stars.  Note  the 
three  red  stars  close  to  circles  turn  out  to  be  abnor¬ 
mal  motions:  running  to  stop  (#8  and  #57),  and 
right  turn  (#43). 


14.3.12):  applying  PLiF-clustering  in  each  level  to  produce 
bi-clustering  and  further  dividing  in  proper  descendants. 

In  MOCAP  we  test  the  clustering  result  on  the  right  foot 
marker  position  with  sequences  of  equal  length  (T  =  107,  m  = 
49).  Since  we  know  the  true  labels  of  each  motion  in  MOCAP, 
we  adopt  a  standard  measure  of  conditional  entropy  from 
the  confusion  matrix  of  prediction  labels  against  true  labels 
to  evaluate  the  clustering  quality.  The  conditional  entropy 
(CE)  tells  the  difference  of  two  clustering  (lower  is  better), 
based  on  the  following  equation:  CE  =  —  cm  1°S  £  cm 

We  use  a  commonly  practiced  method  as  the  baseline 
for  comparison:  first  projecting  the  multiple  sequences  into 
low  dimensional  principal  components  (#dim=#class=2) 
and  then  clustering  by  k-means  with  Euclidean  distance. 
Tab.  4(a)  and  4(b)  show  the  confusion  matrices  and  their 


Table  4:  Clustering  on  MOCAP  right  foot  marker  z- 
coordinate:  Confusion  matrix  and  conditional  en¬ 
tropy.  Note  the  ideal  confusion  matrix  will  be  di¬ 
agonal,  which  has  conditional  entropy  of  0.  Note  in 
both  way  PLiF  wins. 
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Figure  5:  PLiF- 
clustering  on 

BGP  traffic  data. 
Note  how  geo¬ 
graphically  close 
routers  have 

been  clustered 
together. 


conditional  entropies  from  the  predicted  grouping  by  the 
baseline  and  by  PLiF  clustering  respectively.  Note  while 
baseline  makes  nearly  random  guesses,  our  method  could 
identify  all  walking  and  almost  all  running  motions  correctly. 
The  only  three  (out  of  49)  mistakes  by  PLiF  turn  out  to  be 
two  running  to  stop  motions  and  a  right  turn.  As  a  typi¬ 
cal  example  in  Fig.  4(a),  those  mistakes  have  a  very  similar 
pattern  with  walking  motion,  so  that  even  human  would  be 
confused. 

As  an  exploratory  example,  we  use  PLiF-clustering  to  find 
groups  on  BGP  data  -  we  do  not  have  ground-truth  labels  of 
each  sequence  here.  Fig.  5  shows  the  results  (each  clus¬ 
ter  is  shown  encircled).  Note  that  the  results  match  well 
with  the  notion  that  geographically  closer  routers  tend  to 
be  more  correlated  than  others.  This  is  because  the  BGP 
routing  protocol  itself  tries  to  find  shorter  routes  which  re¬ 
sults  in  packets  being  sent  locally  to  nearby  routers  rather 
than  routers  far  away.  Thus  closer  routers  may  have  time 
shifts  and  correlations  that  are  captured  by  PLiF. 

4.3  Compression 

The  fingerprints  extracted  by  PLiF  could  be  used  in  a 
compression  setting  as  well.  The  basic  idea  is  to  store  the 
eigen-dynamics  matrix  (A),  its  associated  projection  matrix 
(C h)  and  a  subset  of  expected  value  of  hidden  variables. 
From  Sec.  3.2,  the  eigen-dynamics  A  is  a  diagonal  matrix,  so 
we  only  keep  the  diagonal  part.  We  also  keep  E [zi\  computed 
from  the  E-step  of  EM  algorithm  for  LDS.  To  be  able  to 
recover  from  compression,  we  compute  the  hidden  values 
using  fli  =  V*-E[£i].  PLiF-compression  finds  a  subset  of  J  C 
{ 1 , . .  . ,  T} ,  determining  which  time  tick  of  hidden  values 
will  be  stored.  Here  we  use  a  similar  idea  as  DynaMMo 
compression  [21]  to  select  the  best  subset  of  time  tick  index 
using  dynamics  programming.  To  recover  the  original  signal, 
we  project  back  the  data  matrix  from  those  hidden  variables 
and  dynamics  using  the  following  equations: 

Xi  —  Gh  '  (8) 

flj  —  AJ~*/2i  if  *  €  J  A  i  +  1, . . . ,  j  ^  J  (9) 


(a)  MOCAP  walking(#22)  (b)  CHLORINE 

Figure  6:  Compression:  normalized  reconstruc¬ 
tion  error  versus  compression  ratio.  Note  PLiF 
achieves  up  to  three  times  better  than  state-of-the- 
art  method  DynaMMo  compression. 


(a)  CHLORINE  (b)  BGP 


Figure  7:  PLiF  computation  time  versus  the  length 
of  sequences  on  CHLORINE  and  BGP  datasets:  linear  as 
expected. 


We  did  compression  experiments  on  both  MOCAP  and  CHLORINE 
data  and  evaluated  the  quality  by  relative  error  defined 

i  •  mse(X-X)-m  i  i 

as:  relative  error  =  ^  v ,  where  mse  denotes  mean 

Li  var(Xi) 

square  error  and  var  variance  for  each  sequence.  Fig.  6(a) 
and  6(b)  show  respectively  PLiF-compression  results  for  a 
walking  motion  (#22)  and  CHLORINE  compared  with  PCA 
and  DynaMMo  [21],  Note  here  the  statistics  are  generated 
by  varying  over  different  h  and  number  of  time  ticks  of  hid¬ 
den  variables  to  keep,  and  we  only  plot  the  skyline  of  com¬ 
pression  ratio  and  error. 

4.4  Scalability 

We  now  evaluate  the  scalability  of  PLiF  on  both  MOCAP 
and  CHLORINE  data.  We  took  various  sizes  of  the  CHLORINE 
sequences  (by  truncation)  to  test  the  scalability  with  respect 
to  the  length  and  the  number  of  sequences. 

Fig.  7(a)  and  7(b)  show  the  wall  clock  time  of  PLiF  with 
respect  to  the  length  of  sequences,  on  five  different  number 
of  sequences  from  CHLORINE  data,  and  on  10  sequences  from 
BGP  data  (after  taking  the  logarithm).  In  each  experiment, 
we  set  the  number  of  hidden  variables  h  =  15  for  CHLORINE 
and  h  =  10  for  BGP  and  the  learning  step  runs  at  the  same 
number  of  iterations  (=  20).  In  Fig.  7(a)  and  7(b),  all  wall 
clock  times  fall  in  to  almost  straight  line,  indicating  the 
linear  scalability  of  PLiF  over  the  length  of  sequences. 

We  did  experiment  on  MOCAP  dataset  to  compare  the  speed 
of  PLiF  and  PLiF-naive.  Fig.  8  presents  wall  clock  time  on 


Figure  8:  Wall 

clock  time  of  PLiF 
versus  PLiF-naive 
on  and  MOCAP: 
upto  3x  gains. 
Similarly  experi¬ 
ment  on  CHLORINE 
dataset  obtains 
3.5x  speedup. 

three  typical  MOCAP  sequences,  one  walking  motion  (#22), 
one  jumping  motion  and  one  running  motion  (#45).  PLiF 
is  3  times  faster  PLiF-naive.  Experiments  on  the  CHLORINE 
dataset  reveals  similarly,  PLiF  scales  much  better  than  the 
basic  algorithm  PLiF-naive  over  the  number  of  sequences 
and  gets  up  to  3.5  times  faster  than  the  latter. 

5.  CONCLUSIONS 

The  main  idea  is  the  proposal  and  design  of  PLiF,  for  the 
extraction  of  “ fingerprints ”  from  a  collection  of  co-evolving 
time  sequences.  PLiF  has  all  of  the  following  desirable  char¬ 
acteristics, 

1.  Effectiveness:  The  resulting  features  correspond  to  mem¬ 
bership  weights  in  each  harmonics  group;  thus,  they 
capture  correlations,  despite  the  presence  of  lags,  and 
despite  small  shifts  in  frequency.  The  resulting  dis¬ 
tance  function  agrees  with  human  intuition  and  the 
provided  ground  truth.  Thus,  fingerprints  lead  to  good 
clustering,  as  well  as  visualization. 

2.  Interpretability:  The  fingerprints  correspond  to  groups 
of  harmonics,  which  are  easy  to  interpret. 

3.  Forecasting:  PLiF  can  easily  do  forecasting,  since  it 
is  based  on  linear  dynamical  systems  and  their  cor¬ 
responding  difference  equations.  Thus,  it  can  easily 
do  forecasting  and  compression,  outperforming  SVD 
and  state-of-the-art  compression  methods  (see  Fig¬ 
ure  6(a), 6(b)). 

4.  Scalability:  PLiF  is  fast  and  scalable,  being  linear  on 
the  length  of  the  sequences. 

We  showed  the  basic  version  of  PLiF,  as  well  as  the  final 
one.  Both  are  linear  on  the  length  of  sequences,  but  PLiF 
can  be  up  to  3.5  times  faster,  thanks  to  our  Lemma  5. 

Future  work  could  focus  on  testing  PLiF’s  performance 
on  additional  datasets  and  its  use  for  segmentation  and 
anomaly  detection,  which  as  we  mentioned  are  natural  by¬ 
products  of  any  method  that  can  do  forecasting.  One  lim¬ 
itation  of  the  current  proposed  method  is  the  inability  to 
handle  sequences  of  non-uniform  lengths.  In  such  cases, 
the  naive  way  of  truncating  sequences  wastes  part  of  data. 
Hence  further  work  may  also  target  extending  PLiF  to  clus¬ 
ter  time  series  with  different  lengths. 
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APPENDIX 

A.  ADDITIONAL  RELATED  WORK 

There  is  a  lot  of  work  on  time  series  analysis,  on  indexing, 
clustering,  and  forecasting. 

Indexing,  Signals  and  Streams:  For  indexing,  the  idea  is 
to  extract  features  [3]  and  then  use  a  spatial  access  method. 
Typical  features  include  the  Fourier  transform  coefficients, 
wavelets  (Gilbert  et  ah,  [8],  Jahangiri  et  al.  [12])  piece-wise 
linear  approximations  (Keogh  et  al.  [18]). These  are  mainly 
useful  for  the  Euclidean  distance,  or  variations  (Rafiei  et 
al  [26],  Ogras  et  al  [23]).  Indexing  for  motion  databases  has 
also  attracted  attention,  both  in  the  database  community 
(eg.,  Keogh  et  al.  [19])  as  well  as  in  graphics  (Safonova  et 
al.  [28]). 

The  typical  distance  function  is  the  Euclidean  distance; 
the  other  major  competitor  is  the  time  warping  distance, 
also  known  as  Dynamic  Time  Warping  (DTW)  (e.g.,  see 
the  tutorial  by  Gunopulos  and  Das  [10]).  The  linear-time 
constrained  versions  of  DTW  (Itakura  parallelogram,  Sakoe- 
Chiba  band)  have  been  studied  in  [18,  4[. 

There  is  also  vast,  recent  literature  on  indexing  moving 
objects  (Jensen  et  al.  [14]  Mouratidis  et  al.  [22]),  as  well 
as  streams  (e.g.,  see  the  edited  volume  by  Garofalakis  et 
al.  [6]).  An  additional  recent  application  for  time  series  is 
monitoring  a  data  center  (eg.,  Reeves  et  al.  [27]),  where 
the  goal  is  to  observe  patterns  in  order  to  minimize  energy 
consumption.  An  equally  important  monitoring  application 
is  environmental  sensors  (e.g.,  Deshpande  et  al.  [2]). 

Dimensionality  reduction:  There  are  numerous  papers  on 
the  topic,  with  typical  methods  being  PCA  [15],  SVD/LSI  [5] 
and  random  projections  [24]. 

Autoregression:  Autoregression  is  the  standard  first  step 
for  forecasting.  It  is  part  of  the  ARIMA  methodology,  pi¬ 
oneered  by  Box  and  Jenkins  [1],  and  it  discussed  in  every 
textbook  in  time  series  analysis  and  forecasting.  Kalpakis 
et  al  [17]  used  autoregression  to  extract  features,  using  the 
so-called  cepstrum  method  from  voice  processing.  Kalman 
filters  and  Linear  Dynamical  Systems  are  closely  related  to 
autoregression,  trying  to  detect  hidden  variables  (like  veloc¬ 
ity,  acceleration)  at  every  time-tick,  and  use  them  for  fore¬ 
casting.  In  the  database  community,  Kalman  filters  have 
been  proposed  for  sensor  data  (Jain  et  al  [13])  as  well  as  for 
moving  objects  (Tao  et  al  [30]). 


All  the  above  approaches  are  powerful  and  very  popu¬ 
lar  for  their  intended  problem.  In  fact,  the  proposed  PLiF 
method  uses  some  of  them  as  stepping  stones  (LDS,  PCA). 
However,  none  of  them  achieves  all  the  goals  we  set  in  the 
introduction. 

B.  SPECIAL  CASES  &  THEIR  SHORTCOM¬ 
INGS 

There  are  several  existing  methods,  but  none  matches  all 
the  desirable  properties  illustrated  in  Table  1.  Thus,  none 
is  a  head-on  competitor  to  our  proposed  PLiF  method.  We 
elaborate  on  PCA,  Discrete  Fourier  Transform  and  Linear 
Dynamical  Systems  here  because  (a)  they  are  the  typical 
competitors  for  some  (but  not  all)  of  the  target  tasks  and 
(b)  they  can  help  in  describing  our  PLiF  method  as  well. 

B.l  Principal  Component  Analysis 

Principal  Component  Analysis  (PCA)  is  the  textbook  method 
of  doing  dimensionality  reduction,  by  spotting  redundancies 
and  (linear)  correlations  among  the  given  sequences.  Tech¬ 
nically,  it  gives  the  optimal  low  rank  approximation  for  the 
data  matrix  X.  In  our  running  example  of  Section  2,  the 
matrix  X  would  be  a  5  x  500  matrix,  with  one  row  for  each 
sequence  and  one  column  per  time-tick.  Singular  value  de¬ 
composition  (SVD)  is  the  typical  method  to  compute  PCA. 

For  a  data  matrix  X  (assume  X  is  zero-centered),  SVD  com¬ 
putes  the  decomposition 

X  =  U  •  S  ■  VT 

where  both  U  and  V  are  orthonormal  matrices,  and  S  is  a 
diagonal  matrix  with  singular  values  on  the  diagonal.  Using 
standard  terminology  from  the  PCA  literature,  V  is  called 
the  loading  matrix  and  U  ■  S  will  be  the  component  score. 

To  achieve  dimensionality  reduction,  small  singular  values 
are  typically  set  to  zero  so  that  the  retained  ones  maintain 
80-90%  of  the  energy  (=  sum  of  squares  of  eigenvalues).  We 
shall  refer  to  this  rule  of  thumb  as  the  energy  criterion  [5] 
(equivalently,  this  truncation  produces  a  low  rank  projec¬ 
tion).  In  our  running  example  of  Figure  2(a),  the  U  ■  S  com¬ 
ponent  score  matrix  is  a  5x2  matrix,  since  we  are  retaining 
2  hidden  variables. 

PCA  is  effective  in  dimensionality  reduction  and  in  find¬ 
ing  linear  correlations,  particularly  for  Gaussian  distributed 
data  [31].  However,  the  low  dimensional  projections  are  hard 
to  interpret.  Moreover,  PCA  can  not  capture  time-evolving 
patterns  (since  it  is  designed  to  not  care  about  the  ordering 
of  the  rows  or  the  columns),  and  thus  it  can  not  do  fore¬ 
casting.  Fig.  2(b)  shows  the  top  two  principal  components 
for  the  synthetic  five  sequences.  It  does  not  show  any  clear 
pattern  of  underlying  clusters;  thus  k-means  indeed  gives  a 
poor  final  clustering  result  on  it. 

B.2  Discrete  Fourier  Transform 

The  T-point  Discrete  Fourier  Transform  (DFT)  of  sequence 
(xo,  ■  ■  ■  ,Xt- i)  is  a  set  of  T  complex  numbers  Ck,  given  by 
the  formula 

t-i 

ck  =  EIte ~^kt  fc  =  0, . . . ,  T  —  1 

t= o 

where  i  =  \J—  1  is  imaginary  unit. 

The  Ck  numbers  are  also  referred  to  as  the  spectrum  of  the 
input  sequence.  DFT  is  powerful  in  spotting  periodicities  in 


a  single  sequence,  with  numerous  uses  in  signal,  voice,  and 
image  processing.  However,  it  is  not  clear  how  to  assess  the 
similarity  between  two  spectra,  and  hence  DFT  can  be  un¬ 
suitable  for  clustering.  Moreover,  it  has  several  limitations, 
namely: 

1.  it  can  not  find  arbitrary  frequencies  (only  ones  that  are 
integer  multiples  of  the  base  frequency), 

2.  it  can  not  give  partial  credit  for  signals  with  nearby 
frequences  (‘frequency  proximity’,  Property  P2  in  the 
introduction), 

3.  it  can  not  do  forecasting,  other  than  blindly  repeating 
the  original  signal. 

Due  to  those  limitations,  we  do  not  compare  PLiF  against 
DFT,  in  the  experiments  section  under  clustering. 

B.3  Linear  Dynamical  Systems 

Linear  Dynamical  Systems  (LDS),  also  known  as  Kalman 
Liters,  have  been  used  previously  to  model  multi-dimensional 
continuous  valued  time  series.  The  model  is  described  by  the 


following  equations: 

Zl  = 

Mo  + 

(10) 

Zn-\-l  — 

A-Zn  +  Wn+1 

(ii) 

Xn  = 

C  Zn  + 

(12) 

where  Mo  is  initial  state  of  the  whole  system. 

The  model  assumes  the  observed  data  sequences  (f„)  are 
generated  from  the  a  series  of  hidden  variables  (zn)  with 
a  linear  projection  matrix  C,  and  the  hidden  variables  are 
evolving  over  time  with  linear  transition  matrix  A,  so  that 
next  time  tick  only  depends  on  the  previous  time  tick  as 
in  Markov  chains.  All  noises  (tD’s  and  e’s)  arising  from  the 
process  are  modeled  as  independent  Gaussian  noises  with 
covariances  Qo,  Q  and  R  respectively.  Given  the  observa¬ 
tion  series,  there  exist  algorithms  for  estimating  hidden  vari¬ 
ables  [16]  and  EM  algorithms  for  learning  the  model  param¬ 
eters  [29,  7],  with  publicly  available  implementations4.  The 
EM  algorithm  maximizes  L(9),  the  expected  log-likelihood 
defined  in  Eq.  13,  iteratively.  In  the  E  step,  it  estimates  the 
posterior  distribution  of  the  hidden  variables  conditioned  on 
the  data  sequence  with  fixed  model  parameters;  in  the  M 
step,  it  then  updates  the  model  parameters  by  maximizing 
the  likelihood  using  some  sufficient  statistics  (e.g.  mean  and 
covariance)  from  the  posterior  distribution. 

L(6\X)  =  EXtz\e[~ D(z\,  mo,  Qo) 

T  T 

-  ^2  D(zt,  Azi_i,  Q)  -  ^2  D(xt,  C zt,  R) 

t= 2  t=  1 

log  IQol  -  ^  log  IQ!  -  \  log  |R|](13) 

where  D()  is  the  square  of  the  Mahalanobis  distance,  i.e. 
D(x,y,  E)  =  (x  —  y)T'£~1(x  —  y). 

The  difference  between  our  proposed  PLiF  and  LDS  is 
that  in  addition  to  learning  straight  forward  transitions  and 
projections,  PLiF  will  further  discover  deeper  patterns  be¬ 
hind  them.  The  problem  with  LDS  learning  (see  Fig.  2(c)) 
is  that  the  learned  model  parameters  are  hard  to  interpret. 

4http://people.cs.ubc.ca/~murphy/software/kalman/ 

kalman.html 


C.  EXPERIMENTS  AND  ALGORITHM  DE¬ 
TAILS 

C.l  Proof  Sketches 

Lemma  1.  Consider  the  eigenvalue  equation  A  •  x  =  Xx, 
where  x  is  the  eigenvector.  Taking  the  conjugate  on  both 
sides  we  get  A  ■  x  =  Xx.  As  A  contains  only  real  entries, 
A-x  =  Xx.  Hence,  if  the  conjugate  A  is  an  eigenvalue  of  A, 
x  is  also  a  corresponding  (conjugate)  eigenvector.  □ 

Lemma  2. 

-(new  -x  r*  -» 

Zn  =  V  •  Zn 

=  V*  ■  A  •  zn- 1  +  noise 
=  V*  ■  V  ■  A  ■  V*  •  zn- 1  +  noise 

A  -mew  , 

•  zn~  i  +  noise 

xi  =  C  ■  po  +  noise  =  C  ■  V  ■  V*  •  mo  +  noise 
=  Ch  •  Mo  +  noise 
X2  =  C  ■  22  +  noise  =  C  •  A  •  zi  +  noise 
=  C  ■  V  ■  A  ■  V*  •  /Jo  +  noise 

C*  -*new  , 

h  ■  A  •  Mo  +  noise 

The  result  then  follows  by  induction  on  the  number  of  time 
ticks.  □ 

C.2  Experimental  Setup 

We  describe  here  our  experimental  setup  and  datasets. 

•  Mocap  data  (MOCAP):  Motion  capture  involves  record¬ 
ing  human  motion  through  tracking  the  marker  move¬ 
ment  on  human  actors  and  then  turn  them  into  a  se¬ 
ries  of  multi-dimensional  coordinates  in  3d  space.  We 
use  a  publicly  available  mocap  dataset  from  CMU5. 
It  includes  49  walking  and  running  motions  of  sub- 
ject#16.  Each  motion  sequence  contains  93  positions 
for  31  markers  in  body  local  coordinate  and  three  ref¬ 
erence  coordinates. 

•  Chlorine  Data  (CHLORINE):  The  chlorine  dataset  is  pub¬ 
licly  available6  and  it  contains  m=166  sequences  of 
Chlorine  concentration  measurements  on  a  water  net¬ 
work  over  15  days  at  the  rate  of  one  sample  per  5  min¬ 
utes  (T=4310  time  ticks).  The  dataset  was  produced 
by  the  EPANET  2  hydraulic  analysis  package7,  which 
reflects  periodic  patterns  (daily  cycles,  dominating  resi¬ 
dential  demand  pattern)  in  the  Chlorine  concentration, 
with  a  few  exceptions  and  time  shifts  (see  sample  se¬ 
quences  in  Fig.  9(a)). 

•  Router  Data  (BGP):  We  examine  BGP  Monitor  data 
containing  18  million  BGP  update  messages  over  a  pe¬ 
riod  of  two  years  (09/2004  to  09/2006)  from  the  Data- 
pository  project8.  We  consider  the  number  of  updates 
received  by  a  router  every  10  minutes.  A  snippet  is 
shown  in  Fig.  9(b).  As  the  signals  are  very  bursty,  we 
take  their  logarithms  (see  Fig.  9(c)).  The  preprocessed 

5http:  / /mocap. cs.cmu.edu 
6  www.cs.cmu.edu/afs/cs/project  /  spirit- 
l/www/data/cl2fullLarge.zip 

'  http:/ /www. epa.gov/nrmrl/wswrd/dw/epanet.html 
8  http:  / /www.  datapository.net /bgpmon  / 


Figure  9:  Sample  snippets  from  datasets,  (a)  CHLORINE  shows  daily  periodicity;  (b)  BGP  is  bursty  with  no 
periodicities,  thus  we  take  the  logarithm  (shown  in  part  (c)).  No  obvious  patterns,  in  neither  (b)  nor  (c). 


BGP  time  series  in  the  experiment  consists  of  m=10  se¬ 
quences  (routers)  of  T=103,968  time  ticks.  The  routers 
are  in  10  major  centers  (Atlanta,  Washington  DC,  Seat¬ 
tle  etc.). 

Our  algorithms  are  implemented  in  Matlab  2008b,  and 
running  on  a  machine  with  Windows  XP,  3.2GHz  dual  core 
CPU  and  2G  RAM. 

C.3  Proposed  PLiF:  Scaling  Up 


Algorithm  1:  PLiF 

Input:  X:  m  sequences  with  duration  T,  and  k 
Output:  fingerprints  F 

1  choose  ft  by  80%-95%  energy  criterion  ; 

//  learning  dynamics  (see  Sec. 3.1  and  Eq.13) 

2  A,  C  <—  arg  maxg  L(9;  X)  ; 

//  canonicalization,  see  Sec. 3. 2 

3  compute  A,  V  s.t.  A  ■  V  =  V  ■  A; 

//  compensating,  see  Sec. 3. 2 

4  Ch  =  C  ■  V; 

//  obtain  polar  form,  see  Sec. 3. 3 

5  D  <—  keep  conjugate  pairs  of  columns  in  Ch', 

6  E  <—  take  element-wise  magnitude  of  D; 

7  Cm  <—  eliminate  duplicate  columns  in  E; 

//  finding  harmonics  grouping,  see  Sec. 3. 4 
3  Cm  *  Cm  mean(Cm), 

9  compute  Ufe,  Sfe,  Vfe  <—  argmin||Cm  —  Ufe  •  Sk  ■  V)T \\fro 
10  F  <—  Us,  •  Sk; 


We  refer  to  the  algorithm  given  in  Section  3  as  PLiF-naive 
(pseudo-code  given  in  Alg.  1),  which  gives  the  intuitive  idea 
and  motivation  for  each  algorithmic  step.  In  this  section 
we  will  focus  on  the  scalability  of  the  algorithm  and  pro¬ 
pose  a  faster  implementation.  Since  PLiF-naive  involves  a 
fair  amount  of  matrix  inversion,  it  is  cubic  to  the  size  of 
matrix  of  interest.  To  speed  up  the  algorithm,  our  idea  is 
to  perform  smarter  and  faster  matrix  inversion  instead  of 
straight-forward  inversion.  We  will  first  analyze  the  com¬ 
plexity  of  PLiF-naive.  We  make  an  assumption  here  that 
the  length  of  data  sequence  is  much  greater  than  the  dimen¬ 
sion,  T  m,  which  is  the  common  cases  as  otherwise  the 
resulting  LDS  model  will  be  under-determined. 

Proof  Sketch  of  Lemma  4.  :  In  each  iteration  of  learning 
LDS  (Alg.  1,  step  2),  it  does  mxm  and  ftx  ft.  matrix  inversion 
for  T  length  of  sequences.  Rest  all  steps  including  eigen 


value  decomposition  in  canonicalization,  taking  polar  form 
and  grouping  harmonics  with  SVD,  take  at  most  0(m3). 
Therefore,  PLiF-naive  has  complexity  of  Obliteration  ■  T  ■ 

(m3  +  ft3)).  □ 

The  time  complexity  of  PLiF-naive  is  linear  with  respect 
to  the  length  of  sequences,  but  cubic  to  the  number  of  se¬ 
quences.  Without  going  too  much  into  the  details,  we  de¬ 
scribe  briefly  where  such  cubic  complexity  arises  in  PLiF- 
naive.  The  major  cubic  computation  involves  calculation  of 
the  inverse  of  the  following  mxm  matrix  while  learning 
LDS  in  the  first  step: 

(CP,jCT  +  R)_1  (14) 

Here  C  is  size  m  x  ft,  P„  is  h  x  h,  and  R  is  m  x  m.  P„  is  a 
complicated  matrix,  hence  we  omit  details  of  P„  for  brevity 
(full  details  can  be  found  in  [16]).  This  is  updated  in  each 
time  tick  while  learning  the  LDS  model  (Alg.  1,  step  2). 
Cubic  computation  elsewhere  is  only  a  negligible  fraction  in 
the  typical  case  of  m  <  T. 

How  to  scale  up:  We  will  focus  on  speeding  up  the  com¬ 
putation  of  the  inversion.  The  idea  underlying  our  method 
is  using  the  Woodbury  matrix  identity  [9]  to  perform  an 
inverse  on  a  smaller  matrix  ( ft  x  ft.)  instead  of  direct  large 
matrix  (m  x  m)  inversion.  In  typical  cases,  h  is  much  smaller 
than  m ,  therefore  it  will  be  significantly  faster.  We  can  then 
substitute  the  matrix  inversions  in  Alg.  1  step  2  with  faster 
ones.  We  will  refer  to  this  faster  version  as  PLiF. 

Lemma  5.  PLiF  can  be  computed  within  time  of  Obliteration- 
T  ■  (m2  •  ft  +  h3)  +  m  ■  ft2). 

Proof  Sketch.  :  We  will  analyze  Alg.  1  step  by  step.  The 
Woodbury  formula  tells  the  following  identity  for  invertible 
X: 

(X+YZYT)_1  =  X_1-X'1Y(Z“1-|-YtX”1Y)_1YtX~1 

By  applying  the  identity,  we  obtain  the  following  equation 
for  computing  Eq.  14: 

R1  -  R^CXP,;1  +  CTR~1C)~1CTR~1  (15) 

Since  C  and  R  are  fixed  inside  each  iteration  and  only  up¬ 
dated  once  at  the  end  of  each  iteration,  we  can  therefore  pre¬ 
compute  R_1,  R  3C  and  CrR-1C  in  the  beginning  of  each 
learning  iteration.  For  each  time  tick  of  the  data  sequence, 
the  EM  algorithm  for  learning  LDS  requires  the  inversion 
of  two  ft  x  ft.  matrices  and  multiplication  of  m  x  ft.,  hxh 
and  ft  x  m  matrices.  Thus  it  takes  0(T  ■  (m2ft  +  ft3))  during 
each  iteration  of  learning  the  LDS  assuming  T>ro,  and  it 


follows  that  Alg.  1  step  2  takes  0(#iteration-T(m2  h+ h3)) . 

In  addition,  Alg.  1  step  3  takes  0(h3),  step  4  takes  0(mh2), 
and  step  10  takes  0(mh 2).  □ 


